Skip to content

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Sep 23, 2025

Description

Follow up of #19986.

This PR reduces the output column buffer sizes needed to materialize columns with a list parent such as list<list<...>>, list<str>, list<list<..<str>..>> etc. against pruned parquet pages in the next-gen reader. By doing so, we also eliminate non-empty nulls across list hierarchies speeding up their materialization.

Checklist

Copy link

copy-pr-bot bot commented Sep 23, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 23, 2025
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress tests Unit testing for project cuIO cuIO issue strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 23, 2025
return std::pair{std::move(table), std::move(buffer)};
}

/**
Copy link
Member Author

@mhaseeb123 mhaseeb123 Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this simply moved as is from hybrid_scan_test.cpp. No need to review

return cudf::test::strings_column_wrapper(elements, elements + num_ordered_rows);
}

std::unique_ptr<cudf::table> concatenate_tables(std::vector<std::unique_ptr<cudf::table>> tables,
Copy link
Member Author

@mhaseeb123 mhaseeb123 Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this simply moved as is from hybrid_scan_test.cpp. No need to review

@mhaseeb123 mhaseeb123 marked this pull request as ready for review September 29, 2025 22:04
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner September 29, 2025 22:04
* @param page_mask Page mask indicating if this column needs to be decoded
* @param min_rows crop all rows below min_row
* @param num_rows Maximum number of rows to read
* other settings and records the result in the PageInfo::str_bytes_all field
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale comments

@mhaseeb123 mhaseeb123 changed the title Reduce output buffer sizes for pruned pages of compound columns with a list parent Reduce output buffer sizes for pruned pages of columns with a list parent Sep 29, 2025
@mhaseeb123 mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Sep 30, 2025
mhaseeb123 and others added 2 commits October 7, 2025 18:25
Co-authored-by: Vukasin Milovanovic <vmilovanovic@nvidia.com>
@mhaseeb123
Copy link
Member Author

pre-commit.ci autofix

@mhaseeb123 mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels Oct 8, 2025
@mhaseeb123
Copy link
Member Author

/ok to test a471b6e

@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 805043e into rapidsai:branch-25.12 Oct 8, 2025
132 checks passed
@mhaseeb123 mhaseeb123 deleted the fea/reduce-output-buffer-sizes-for-pruned-pages branch October 8, 2025 20:06
rapids-bot bot pushed a commit that referenced this pull request Oct 15, 2025
Follow up of #20086 and #19986.

This PR enables skipping decompression of parquet data pages marked as pruned in the new experimental parquet reader. This PR also zeros out nesting size information (used to allocate output buffers) for pruned pages right when it's being computed instead of resetting it later-on just before buffer allocation in (#20086).

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - https://github.com/nvdbaranec
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #20192
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python) tests Unit testing for project

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants